{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用 LinearRNA 进行 RNA 二级结构预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LinearRNA 包括一系列的线性时间 RNA 二级结构分析算法：**LinearFold** 和 **LinearPartition**。关于这个主题的更多信息请查阅[这里](https://github.com/PaddlePaddle/PaddleHelix/c/pahelix/toolkit/linear_rna/README_cn.md)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RNA 二级结构预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**LinearFold** 是第一个以线性时间预测 RNA 二级结构的算法，可将 RNA 二级结构预测的时间大大降低。LinearFold 的论文已经在计算生物学顶级会议 ISMB 及生物信息学权威杂志 Bioinformatics 上发表。论文链接：[LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](http://academic.oup.com/bioinformatics/article/35/14/i295/5529205)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第一部分：LinearFold\n",
    "首先，我们需要安装pahelix依赖："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "安装成功\n"
     ]
    }
   ],
   "source": [
    "!pip install paddlehelix\n",
    "from IPython.display import clear_output\n",
    "clear_output()\n",
    "print(\"安装成功\")"
   ]
  },
  {
   "source": [
    "## 机器学习模型\n",
    "```bash\n",
    "linear_fold_c(rna_sequence, beam_size = 100, use_constraints = False, constraint = \"\", no_sharp_turn = True)\n",
    "```\n",
    "使用 [CONTRAfold](https://pubmed.ncbi.nlm.nih.gov/16873527/) 中的机器学习模型。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参数说明\n",
    "* rna_sequence: string，需要预测结构的 RNA 序列。\n",
    "* beam_size: int（默认值 100），控制 beam pruning size 的参数，设置为 0 关闭 beam pruning。该参数越大，则预测速度越慢，而与精确搜索相比近似效果越好。\n",
    "* use_constraints: bool（默认值 False），在预测二级结构时增加约束条件。当为 True 时，constraint 参数需要提供约束序列。\n",
    "* constraint: string（默认值空字符串），二级结构预测约束条件。当提供约束序列时，use_constraints 参数需要设置为 True。该约束须与输入的 RNA 序列长度相同，每个点位可以指定“? . ( )”四种符号中的一种，其中“?”表示该点位无限制，“.”表示该点位必须是 unpaired，“(”与“)”表示该点位必须是 paired。注意“(”与“)”必须数量相等，即相互匹配。具体操作请参考运行实例。\n",
    "- no_sharp_turn: bool（默认值 True），不允许在预测的 hairpin 结构中出现 sharp turn。"
   ]
  },
  {
   "source": [
    "### 返回值\n",
    "* tuple(string, double): 返回一个二元组, 第一个位置是结构序列，第二个位置是结构的folding free energy。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "('..((((.(((....)))...))))....((((((............................))))))....',\n",
       " 0.4548597317188978)"
      ]
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "import pahelix.toolkit.linear_rna as linear_rna\n",
    "input_sequence = \"AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU\"\n",
    "linear_rna.linear_fold_c(input_sequence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "('..(.(((......)((........))(((......(.......).))).....))..)..............',\n",
       " -27.328358240425587)"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "# with constraints\n",
    "constraint = \"??(???(??????)?(????????)???(??????(???????)?)???????????)??.???????????\"\n",
    "linear_rna.linear_fold_c(input_sequence, use_constraints = True, constraint = constraint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 热力学模型\n",
    "```bash\n",
    "linear_fold_v(rna_sequence, beam_size = 100, use_constraints = False, constraint = \"\", no_sharp_turn = True)\n",
    "```\n",
    "使用 [Vienna RNAfold](https://almob.biomedcentral.com/articles/10.1186/1748-7188-6-26) 中的热力学自由能模型。\n",
    "\n",
    "参数和返回值与机器学习模型一致。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "('..((((.(((....)))...))))....((((((.((((.....))))...((((...))))))))))....',\n",
       " -18.4)"
      ]
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "linear_rna.linear_fold_v(input_sequence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "('..(.(((......)((........))(((......(.......).))).....))..)..............',\n",
       " 13.4)"
      ]
     },
     "metadata": {},
     "execution_count": 12
    }
   ],
   "source": [
    "# with constriants\n",
    "linear_rna.linear_fold_v(input_sequence, use_constraints = True, constraint = constraint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二部分：LinearPartition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**LinearPartition** 是世界上最快的 RNA 配分方程和碱基对概率预测算法。该算法功能更加强大，可以模拟 RNA 序列在平衡态时成千上万种不同结构的分布，并预测碱基对概率矩阵。LinearPartition 算法同样被 ISMB 顶会接收并在 Bioinformatics 杂志上发表。论文链接：[LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)"
   ]
  },
  {
   "source": [],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 机器学习模型\n",
    "```bash\n",
    "linear_partition_c(rna_sequence, beam_size = 100, bp_cutoff = 0.0, no_sharpe_turn = True)\n",
    "```\n",
    "`linear_c`：使用 [CONTRAfold](https://pubmed.ncbi.nlm.nih.gov/16873527/) 中的机器学习模型。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参数说明\n",
    "* rna_sequence: string，需要计算配分函数和碱基对概率的RNA序列。\n",
    "* beam_size: int（默认值 100），控制 beam pruning size 的参数，默认值为 100。该参数越大，则预测速度越慢，而与精确搜索相比近似效果越好。\n",
    "* bp_cutoff: double（默认值 0.9），只输出概率大于等于 bp_cutoff 的碱基对及其概率，0 <= bp_cutoff <= 1。\n",
    "* no_sharp_turn: bool（默认值 True），不允许在预测的 hairpin 结构中出现 sharp turn。\n",
    "\n",
    "### 返回值\n",
    "* tuple(string, list): 返回一个二元组，第一个位置是配分函数值，第二个位置是存有碱基对及其概率的列表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "(0.6399469375610352,\n",
       " [(4, 13, 0.2007068395614624),\n",
       "  (10, 22, 0.24661558866500854),\n",
       "  (11, 21, 0.2457289695739746),\n",
       "  (12, 20, 0.20926791429519653)])"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "input_sequence = \"UGAGUUCUCGAUCUCUAAAAUCG\"\n",
    "linear_rna.linear_partition_c(\"UGAGUUCUCGAUCUCUAAAAUCG\", bp_cutoff = 0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 热力学模型\n",
    "```bash\n",
    "linear_partition_v(rna_sequence, beam_size = 100, bp_cutoff = 0.0, no_sharpe_turn = True)\n",
    "```\n",
    "使用 [Vienna RNAfold](https://almob.biomedcentral.com/articles/10.1186/1748-7188-6-26) 中的热力学自由能模型。参数和返回值与机器学习模型一致。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "(-1.9573111534118652,\n",
       " [(2, 15, 0.833134651184082),\n",
       "  (3, 14, 0.8365526795387268),\n",
       "  (4, 13, 0.8355389833450317)])"
      ]
     },
     "metadata": {},
     "execution_count": 14
    }
   ],
   "source": [
    "linear_rna.linear_partition_v(input_sequence, bp_cutoff = 0.5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.7.9 64-bit ('pahelix': conda)",
   "metadata": {
    "interpreter": {
     "hash": "ffeddd7d1ba530e12d9eb06e10b8b56b9c925e18ed1aeb11a97f0d6e21c5108a"
    }
   }
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}